I have chosen to analyse the red wine data set to determine the key variables that determine wine quality. My aim is to use my learnings here to enable me to provide a (slightly) more sophisticated commentary on which red wines would taste good in the future without actually tasting it!
In this data set, each record represents a type of red wine. The characteristics of each wine is provided (such as fixed acidity levels, pH etc.) and finally, each wine is assigned a quality score (ranging from 0 to 10), based on sensory data.
The red wine data set is loaded and the first 10 rows is printed out below.
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.4 0.70 0.00 1.9 0.076
## 2 2 7.8 0.88 0.00 2.6 0.098
## 3 3 7.8 0.76 0.04 2.3 0.092
## 4 4 11.2 0.28 0.56 1.9 0.075
## 5 5 7.4 0.70 0.00 1.9 0.076
## 6 6 7.4 0.66 0.00 1.8 0.075
## 7 7 7.9 0.60 0.06 1.6 0.069
## 8 8 7.3 0.65 0.00 1.2 0.065
## 9 9 7.8 0.58 0.02 2.0 0.073
## 10 10 7.5 0.50 0.36 6.1 0.071
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## 7 15 59 0.9964 3.30 0.46 9.4
## 8 15 21 0.9946 3.39 0.47 10.0
## 9 9 18 0.9968 3.36 0.57 9.5
## 10 17 102 0.9978 3.35 0.80 10.5
## quality
## 1 5
## 2 5
## 3 5
## 4 6
## 5 5
## 6 5
## 7 5
## 8 7
## 9 7
## 10 5
The variables are as follows.
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
Units for each variable as follows: 1. fixed acidity (tartaric acid - g / dm^3) 2. volatile acidity (acetic acid - g / dm^3) 3. citric acid (g / dm^3) 4. residual sugar (g / dm^3) 5. chlorides (sodium chloride - g / dm^3 6. free sulfur dioxide (mg / dm^3) 7. total sulfur dioxide (mg / dm^3) 8. density (g / cm^3) 9. pH 10. sulphates (potassium sulphate - g / dm3) 11. alcohol (% by volume) 12. quality (score between 0 and 10, based on sensory data)
This data set has 1599 records across 13 variables.
## [1] 1599 13
The structure of the data is as follows.
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
A quick summary of the data set below.
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
A quick look at the histogram for each variable below, with summary statistics ( min = 1st red line, q1 = 1st blue line, median = Orange line, q3 = 2nd blue line, max = 2nd red line, mean = Green line ).
Let’s look at each variable in turn.
Fixed acidity is not normally distributed. It has a long tail as there are some large acidity figures, although with low frequency.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
## [1] "Standard deviation is: 1.7410963181277"
The 1og10 plot has a more normal distribution, as can be seen by smaller standard deviation.
## [1] "Standard deviation 1og10 scale is: 0.0773478330552048"
Volatile acidity is not normally distributed. It has a long tail as there are some large acidity figures, although with low frequency. It is more normally distributed than fixed acidity.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
## [1] "Standard deviation is: 0.179059704153535"
The 1og10 plot has a slightly more normal distribution, as can be seen by smaller standard deviation.
## [1] "Standard deviation 1og10 scale is: 0.0499117519457995"
Citric acid has a long tail distribution as with the previous 2 variables. It is not normally distrbuted and the distribution is stretched by large acid levels for several red wines.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
## [1] "Standard deviation is: 0.194801137405319"
Attempts to normalise the distribution of citric acid resulted in a reversal of the direction of long-tail distribution from left skewed (bulk of observation on the low end) to right skewed (bulk of observation on the high end).
## [1] "Standard deviation 1og10 scale is: 0.0661964066049957"
Residual sugar has a bulk of observations in the lower end, with a lot of outliers. This results in a large right skew, long tailed distribution. We will attempt to normalise this using log10 plot.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
## [1] "Standard deviation is: 1.40992805950728"
The log10 plot is more normally distributed, however it is still quite right skewed.
## [1] "Standard deviation 1og10 scale is: 0.11724613311005"
Similar to residual sugar, chlorides has a bulk of observations in the lower end, with a lot of outliers. This results in a large right skew, long tailed distribution. We will attempt to normalise this using log10 plot.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
## [1] "Standard deviation is: 0.0470653020100901"
The log10 plot is more normally distributed, but still has a rather right skew.
## [1] "Standard deviation 1og10 scale is: 0.0169336059155253"
Free sulfur dioxide is quite similar in distribution to chlorides and residual sugar, although it is less extreme. It is right skewed, has sizeable outliers and is not normally distributed.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
## [1] "Standard deviation is: 10.4601569698097"
Attempts to normalise using log10 plot has worked, with a more even distribution of observations and greatly reduced outlier count.
## [1] "Standard deviation 1og10 scale is: 0.270908515351267"
The distribution of total sulfur dioxide is very similar to free sulfur dioxide.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
## [1] "Standard deviation is: 32.8953244782991"
Using a log10 plot, we were able to normalise the distribution and greatly reduce the number of outliers, like in the previous variable.
## [1] "Standard deviation 1og10 scale is: 0.296438379192448"
Density looks normally distributed, with some outliers. Most observations are concentrated between 0.995 and 1. Mean and medain are identical.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
## [1] "Standard deviation is: 0.00188733395384256"
Using a log10 plot has made this variable even more normally distributed.
## [1] "Standard deviation 1og10 scale is: 0.00041048390118351"
Much like density, pH is very normally distributed. Mean and median are identical and pH of wines are mostly concentrated between 3 to 3.5, with some outliers. 75% of wines have pH less than 3.4 (quite acidic).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
## [1] "Standard deviation is: 0.154386464903543"
Using a log10 plot has made this variable even more normally distributed.
## [1] "Standard deviation 1og10 scale is: 0.0155302548325435"
Similar to total and free sulfur dioxide, sulphates is right skewed with a bulk of observations in the lower end of the sulphates scale. It is long tailed and not normally distributed, with substantial amount of large, infrequent outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
## [1] "Standard deviation is: 0.16950697959011"
Using a log10 scale, the distribution of sulphates is more binomial, with a marked reduction in number of outliers. Mean and median are closer together.
## [1] "Standard deviation 1og10 scale is: 0.0407069650816158"
Alcohol is also right skewed, with a bulk of observations in the lower end of the alcohol scale. 75% of wines have less than 11% alcohol.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
## [1] "Standard deviation is: 1.06566758184739"
Using a log10 scale, the distribution of alcohol is more normally distributed, with minimal outliers remaining.
## [1] "Standard deviation 1og10 scale is: 0.0392750529614007"
Quality is quite normally distributed, with a slight left skew. 50% of all wines analysed have a score between 5 and 6, the 8 being the highest score. This means that this analysis will be based mostly off of average quality wines.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
## [1] "Standard deviation is: 0.807569439734705"
Total acidity is calculated variable (Fixed acidity + Volatile acidity). As a combination of both fixed and volatile acidity, it has a similar distribution to them.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.120 7.680 8.445 8.847 9.740 16.285
## [1] "Standard deviation is: 1.70404692804485"
Using a log10 scale, the distribution of total acidity is more normally distributed, with less outliers remaining.
## [1] "Standard deviation 1og10 scale is: 0.0718503550795018"
The dataset has 1599 different wines, with characteristics recorded across 13 variables. All the variables are numerical in nature.
Quality is of greatest interest, and it forms the anchor for the upcoming analysis. My aim is to determine how the other variables drive the quality of wine up or down.
From past experience and through quick research, it seems features such as alcohol content, pH, residual sugar and acid content are factors that result in wines of distinct tastes and textures (which when combined and tasted, either elicits positive or negative responses).
A new variable called total acidity was created (Fixed acidity + Volatile acidity). This is done so that we have a general acidity variable to analyse. Citric acid is not included in this new variable.
Based on the distribution of quality, our analyses will be based mostly on wines with an average quality rating (between 5 and 6). This will make it difficult to determine what combination of levels within each variable makes very good, or very bad scoring wines.
Quite a few of the variables were not normally distributed, contrary to expectations of more normal distributions. There were extreme outliers for some of these variables. In most cases where the distribtuion is long tailed, the log10 scale adjustment is used to make the observations of variables more normally distrbuted.
Firstly, we look at correlation between the variables. Quality has relatively strong correlation with alcohol (positive correlation of 0.476) and volatile acidity (negative correlation of -0.391).
Quality has a slight but notable correlation with citric acid (positive correlation of 0.226) and sulphates (positive correlation of 0.251).
Initial expectations that pH and residual sugar would impact quality will need to be revisited as both have very low correlation with quality.
We will look at the interaction between quality of wines and each of the potential predictor variables below.
The following applies to the plots below: - Median: Orange line - Mean: Green line - Linear smoothing: Dashed blue line
Fixed acidity and quality are slightly positive correlated (cor = 0.124). As mean and median of fixed acidity increases, the quality of wines increases.
The log10 plots show the same pattern albeit with less outliers.
##
## Pearson's product-moment correlation
##
## data: rw$fixed.acidity and rw$quality
## t = 4.996, df = 1597, p-value = 6.496e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.07548957 0.17202667
## sample estimates:
## cor
## 0.1240516
## rw$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.700 7.150 7.500 8.360 9.875 11.600
## --------------------------------------------------------
## rw$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.600 6.800 7.500 7.779 8.400 12.500
## --------------------------------------------------------
## rw$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.000 7.100 7.800 8.167 8.900 15.900
## --------------------------------------------------------
## rw$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.700 7.000 7.900 8.347 9.400 14.300
## --------------------------------------------------------
## rw$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.900 7.400 8.800 8.872 10.100 15.600
## --------------------------------------------------------
## rw$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.000 7.250 8.250 8.567 10.225 12.600
Volatile acidity and quality are negatively correlated (cor = -0.391). Wines of better quality have a markedly lower mean and median volatile acidity content.
The relationship between volatile acidity and pH is discussed in a later section.
The log10 plots show the same pattern.
##
## Pearson's product-moment correlation
##
## data: rw$volatile.acidity and rw$quality
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4313210 -0.3482032
## sample estimates:
## cor
## -0.3905578
## rw$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4400 0.6475 0.8450 0.8845 1.0100 1.5800
## --------------------------------------------------------
## rw$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.230 0.530 0.670 0.694 0.870 1.130
## --------------------------------------------------------
## rw$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.180 0.460 0.580 0.577 0.670 1.330
## --------------------------------------------------------
## rw$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1600 0.3800 0.4900 0.4975 0.6000 1.0400
## --------------------------------------------------------
## rw$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3000 0.3700 0.4039 0.4850 0.9150
## --------------------------------------------------------
## rw$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2600 0.3350 0.3700 0.4233 0.4725 0.8500
Citric acid and quality are positively correlated (cor = 0.226). Wines of better quality have a markedly higher mean and median citric acid content.
The relationship between citric acid and pH is discussed in a later section.
The log10 plots show the same pattern.
##
## Pearson's product-moment correlation
##
## data: rw$citric.acid and rw$quality
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1793415 0.2723711
## sample estimates:
## cor
## 0.2263725
## rw$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0050 0.0350 0.1710 0.3275 0.6600
## --------------------------------------------------------
## rw$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0300 0.0900 0.1742 0.2700 1.0000
## --------------------------------------------------------
## rw$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0900 0.2300 0.2437 0.3600 0.7900
## --------------------------------------------------------
## rw$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0900 0.2600 0.2738 0.4300 0.7800
## --------------------------------------------------------
## rw$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.3050 0.4000 0.3752 0.4900 0.7600
## --------------------------------------------------------
## rw$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0300 0.3025 0.4200 0.3911 0.5300 0.7200
Residual sugar and quality have a very small, negligible positive correlation (cor = 0.0137). Wines generally have residual sugar content of 0 to 3.1. This variable has minimal impact on quality of wine.
However, we should consider the interation between density and residual sugar (covered in Quality & Density section), which suggests it may have an indirect impact on quality of wine.
The log10 plots show the same pattern.
##
## Pearson's product-moment correlation
##
## data: rw$residual.sugar and rw$quality
## t = 0.5488, df = 1597, p-value = 0.5832
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03531327 0.06271056
## sample estimates:
## cor
## 0.01373164
## rw$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.200 1.875 2.100 2.635 3.100 5.700
## --------------------------------------------------------
## rw$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.300 1.900 2.100 2.694 2.800 12.900
## --------------------------------------------------------
## rw$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.200 1.900 2.200 2.529 2.600 15.500
## --------------------------------------------------------
## rw$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.477 2.500 15.400
## --------------------------------------------------------
## rw$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.200 2.000 2.300 2.721 2.750 8.900
## --------------------------------------------------------
## rw$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.400 1.800 2.100 2.578 2.600 6.400
Chlorides and quality are negatively correlated (cor = -0.129). Wines of better quality have a lower mean and median chloride content. A bulk of the wine generally has chloride content between 0 and 0.1.
The log10 plots show the same pattern.
##
## Pearson's product-moment correlation
##
## data: rw$chlorides and rw$quality
## t = -5.1948, df = 1597, p-value = 2.313e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.17681041 -0.08039344
## sample estimates:
## cor
## -0.1289066
## rw$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0610 0.0790 0.0905 0.1225 0.1430 0.2670
## --------------------------------------------------------
## rw$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.04500 0.06700 0.08000 0.09068 0.08900 0.61000
## --------------------------------------------------------
## rw$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.03900 0.07400 0.08100 0.09274 0.09400 0.61100
## --------------------------------------------------------
## rw$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.03400 0.06825 0.07800 0.08496 0.08800 0.41500
## --------------------------------------------------------
## rw$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.06200 0.07300 0.07659 0.08700 0.35800
## --------------------------------------------------------
## rw$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.04400 0.06200 0.07050 0.06844 0.07550 0.08600
Free sulfur dioxide and quality have a very small, negligible negative correlation (cor = -0.0507). This variable has minimal impact on quality of wine. Interestingly, the lower quality wines (with scores of 3, 4) and the higher quality wines (with scores of 7, 8) have low levels of free sulfur dioxide. Free sulfur dioxide prevents microbial growth and wine oxidation and is added to wine to improve its aging potential.
The log10 plots show the same pattern with less outliers.
##
## Pearson's product-moment correlation
##
## data: rw$free.sulfur.dioxide and rw$quality
## t = -2.0269, df = 1597, p-value = 0.04283
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.099430290 -0.001638987
## sample estimates:
## cor
## -0.05065606
## rw$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.0 5.0 6.0 11.0 14.5 34.0
## --------------------------------------------------------
## rw$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 6.00 11.00 12.26 15.00 41.00
## --------------------------------------------------------
## rw$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 9.00 15.00 16.98 23.00 68.00
## --------------------------------------------------------
## rw$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 8.00 14.00 15.71 21.00 72.00
## --------------------------------------------------------
## rw$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 6.00 11.00 14.05 18.00 54.00
## --------------------------------------------------------
## rw$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 6.00 7.50 13.28 16.50 42.00
Similar to free sulfur dioxide, total sulfur dioxide and quality have a small negative correlation (cor = -0.185). This variable has a small impact on quality of wine. Interestingly, the lower quality wines (with scores of 3, 4) and the higher quality wines (with scores of 7, 8) have low levels of total sulfur dioxide.
The log10 plots show the same pattern with less outliers.
##
## Pearson's product-moment correlation
##
## data: rw$total.sulfur.dioxide and rw$quality
## t = -7.5271, df = 1597, p-value = 8.622e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2320162 -0.1373252
## sample estimates:
## cor
## -0.1851003
## rw$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 12.5 15.0 24.9 42.5 49.0
## --------------------------------------------------------
## rw$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.00 14.00 26.00 36.25 49.00 119.00
## --------------------------------------------------------
## rw$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 26.00 47.00 56.51 84.00 155.00
## --------------------------------------------------------
## rw$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 23.00 35.00 40.87 54.00 165.00
## --------------------------------------------------------
## rw$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.00 17.50 27.00 35.02 43.00 289.00
## --------------------------------------------------------
## rw$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.00 16.00 21.50 33.44 43.00 88.00
Density and quality are negatively correlated (cor = -0.175). Density has a negative correlation with alcohol content (cor = -0.496) and a positive correlation with residual sugar content (cor = 0.355).
Wines of better quality have a lower mean and median density level. This implies that higher quality wine has higher alcohol content and lower residual sugar content.
The log10 plots show the same pattern.
##
## Pearson's product-moment correlation
##
## data: rw$density and rw$quality
## t = -7.0997, df = 1597, p-value = 1.875e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2220365 -0.1269870
## sample estimates:
## cor
## -0.1749192
## rw$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9947 0.9961 0.9976 0.9975 0.9988 1.0008
## --------------------------------------------------------
## rw$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9934 0.9957 0.9965 0.9965 0.9974 1.0010
## --------------------------------------------------------
## rw$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9926 0.9962 0.9970 0.9971 0.9979 1.0031
## --------------------------------------------------------
## rw$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9954 0.9966 0.9966 0.9979 1.0037
## --------------------------------------------------------
## rw$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9906 0.9948 0.9958 0.9961 0.9974 1.0032
## --------------------------------------------------------
## rw$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9908 0.9942 0.9949 0.9952 0.9972 0.9988
pH and quality have a very small negative correlation (cor = -0.0577). Wines generally have pH levels of 3.2 to 3.5. This variable has small direct impact on quality of wine.
However, we should consider the interation between pH and citric acid and volatile acid, which suggests it may have an indirect impact on quality of wine. pH has a negative correlation with citric acid (cor = -0.542) and a positive correlation with volatile acidity (cor = 0.235). As higher quality wine has lower volatile acidity, this implies that it has lower pH. Higher quality wine also has higher citric acid content, implying that it has lower pH.
The log10 plots show the same pattern.
##
## Pearson's product-moment correlation
##
## data: rw$pH and rw$quality
## t = -2.3109, df = 1597, p-value = 0.02096
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.106451268 -0.008734972
## sample estimates:
## cor
## -0.05773139
## rw$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.160 3.312 3.390 3.398 3.495 3.630
## --------------------------------------------------------
## rw$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.300 3.370 3.382 3.500 3.900
## --------------------------------------------------------
## rw$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.880 3.200 3.300 3.305 3.400 3.740
## --------------------------------------------------------
## rw$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.860 3.220 3.320 3.318 3.410 4.010
## --------------------------------------------------------
## rw$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.920 3.200 3.280 3.291 3.380 3.780
## --------------------------------------------------------
## rw$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.880 3.163 3.230 3.267 3.350 3.720
Sulphates and quality are positively correlated (cor = 0.251).
Sulphates contribute to sulfur dioxide gas that prevents microbial growth and wine oxidation and is added to wine to improve its aging potential. Given this, I expected a high correlation between sulphatesand both total and free sulfur dioxide levels, however the correlation in both cases are negligible.
Wines of better quality have a markedly higher mean and median sulphate content. This enables the wines to age better, and thus taste better.
The log10 plots show the same pattern.
##
## Pearson's product-moment correlation
##
## data: rw$sulphates and rw$quality
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2049011 0.2967610
## sample estimates:
## cor
## 0.2513971
## rw$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4000 0.5125 0.5450 0.5700 0.6150 0.8600
## --------------------------------------------------------
## rw$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.4900 0.5600 0.5964 0.6000 2.0000
## --------------------------------------------------------
## rw$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.370 0.530 0.580 0.621 0.660 1.980
## --------------------------------------------------------
## rw$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4000 0.5800 0.6400 0.6753 0.7500 1.9500
## --------------------------------------------------------
## rw$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3900 0.6500 0.7400 0.7413 0.8300 1.3600
## --------------------------------------------------------
## rw$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.6300 0.6900 0.7400 0.7678 0.8200 1.1000
Alcohol content and quality are positively correlated (cor = 0.476). Wines of better quality have a markedly higher mean and median alcohol content. Most of the wines with quality score of 7+ have alcohol content of 10.8% and above.
Higher alcohol content also means lower density (ie. lighter feel) to the wine given both variables are negatively correlated (cor = -0.496).
The log10 plots show the same pattern.
##
## Pearson's product-moment correlation
##
## data: rw$alcohol and rw$quality
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4373540 0.5132081
## sample estimates:
## cor
## 0.4761663
## rw$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.400 9.725 9.925 9.955 10.575 11.000
## --------------------------------------------------------
## rw$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.00 9.60 10.00 10.27 11.00 13.10
## --------------------------------------------------------
## rw$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.5 9.4 9.7 9.9 10.2 14.9
## --------------------------------------------------------
## rw$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.80 10.50 10.63 11.30 14.00
## --------------------------------------------------------
## rw$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.20 10.80 11.50 11.47 12.10 14.00
## --------------------------------------------------------
## rw$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.80 11.32 12.15 12.09 12.88 14.00
Total acidity and quality have a very small positive correlation (cor = 0.0857). This variable has minimal impact on quality of wine.
The log10 plots show the same pattern, with less outliers.
##
## Pearson's product-moment correlation
##
## data: rw$total.acidity and rw$quality
## t = 3.4378, df = 1597, p-value = 0.0006015
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.03684298 0.13416675
## sample estimates:
## cor
## 0.08570932
## rw$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.460 8.051 8.883 9.245 10.460 12.180
## --------------------------------------------------------
## rw$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.120 7.380 8.185 8.473 9.070 12.960
## --------------------------------------------------------
## rw$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.520 7.735 8.390 8.744 9.490 16.260
## --------------------------------------------------------
## rw$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.300 7.605 8.400 8.845 9.881 14.610
## --------------------------------------------------------
## rw$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.320 7.880 9.110 9.276 10.485 16.285
## --------------------------------------------------------
## rw$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.420 7.625 8.730 8.990 10.530 12.910
The following variables are deemed relevant and could be useful in predicting the quality of wines.
I found that alcohol, volatile acidity have sufficient relationship with quality. On a slightly lower extent, sulphates, citric acid, density and chlorides are also related to quality. All the listed variables will be useful as predictors of wine quality in a predictive model.
These features were chosen as they had a sufficient correlation with wine quality.
I am keen to explore the relationship between residual sugar and density, and also the relationship between pH and citric acid, and how it could impact quality too.
Interestingly, although volatile acidity and citric acid are flagged as useful predictors, fixed and total acidity do not share the same relationships with wine quality.
Sulphates is useful as a wine quality predictor, but the same is not true for free and total sulfur dioxide.
The strongest relationship was between alcohol levels and wine quality with a a correlation of 0.476.
The approach I will take here is to firstly to focus on variables with highest absolute correlation to quality.
For each of the focus variables, I will then pair it with the variables (non-quality) with an absolute correlation of 0.20
Alcohol has the strongest correlation with wine quality (cor = 0.476).
Alcohol has a correlation of -0.202 with volatile acidity. Higher wine quality has higher alcohol content and lower volatile acidity.
Differing wine qualities have a relatively clear distinction in volatile acidity levels.
Action: Add volatile acidity as predictor.
Alcohol has a correlation of -0.496 with density. Higher wine quality has higher alcohol content and lower density.
Differing wine qualities have a relatively clear distinction in density levels.
Action: Add density as predictor.
Alcohol has a correlation of -0.221 with chlorides. Higher wine quality has higher alcohol content and lower chlorides.
Differing wine qualities have a relatively clear distinction in chlorides levels.
Action: Add chlorides as predictor.
The plot is recreated with chlorides under a log10 scale. The patterns are more pronounced.
Alcohol has a correlation of 0.206 with pH. Higher wine quality has higher alcohol content with relatively lower in pH. pH levels tend to increase as alcohol levels increase though.
Differing wine qualities have a relatively clear distinction in pH levels.
Action: Add pH as predictor.
Volatile acidity has a sufficient correlation with wine quality (cor = -0.391).
Volatile acidity has a correlation of -0.261 with sulphates. Higher wine quality has lower volatile acidity and higher sulphates.
Differing wine qualities have a relatively clear distinction in sulphates levels.
Action: Add sulphates as predictor.
The plot is recreated with sulphates under a log10 scale. The patterns are more pronounced.
Volatile acidity has a correlation of -0.552 with citric acid. Higher wine quality has lower volatile acidity and slightly higher citric acid.
Differing wine qualities have a relatively clear distinction in citric acid levels.
Action: Add citric acid as predictor.
Volatile acidity has a correlation of 0.235 with pH. Higher wine quality has lower volatile acidity and sligtly lower pH levels.
Differing wine qualities have a relatively clear distinction in pH levels.
Action: Add pH as predictor.
Sulphates has a sufficient correlation with wine quality (cor = 0.251).
Sulphates has a correlation of 0.313 with citric acid. Higher wine quality has higher sulphates and any slightly higher citric acid levels.
Differing wine qualities have a relatively clear distinction in citric acid levels.
Action: Add citric acid as predictor.
Sulphates has a correlation of 0.371 with chlorides. Higher wine quality has higher sulphates and lower chlorides levels.
Differing wine qualities have a relatively clear distinction in chlorides levels.
Action: Add chlorides as predictor.
Citric acid has a sufficient correlation with wine quality (cor = 0.226).
Citric acid has a correlation of 0.365 with density. Higher wine quality has slightly higher citric acid levels and have lower density.
Differing wine qualities have a relatively clear distinction in density levels.
Action: Add density as predictor.
Citric acid has a correlation of 0.204 with chlorides. Higher wine quality is not really impacted by citric acid levels and have lower chlorides.
Differing wine qualities have a relatively clear distinction in chlorides levels.
Action: Add chlorides as predictor.
Citric acid has a correlation of -0.542 with pH. Higher wine quality is not really impacted by citric acid levels and has slightly lower pH levels.
Differing wine qualities have a relatively clear distinction in pH levels.
Action: Add pH as predictor.
Density has a sufficient correlation with wine quality (cor = -0.175).
Density has a correlation of 0.201 with chlorides. Higher wine quality has lower density and lower chlorides.
Differing wine qualities have a relatively clear distinction in chlorides levels.
Action: Add chlorides as predictor.
Density has a correlation of -0.342 with pH. Higher wine quality has lower density and slightly lower pH levels.
Differing wine qualities have a relatively clear distinction in pH levels.
Action: Add pH as predictor.
Density has a correlation of 0.355 with residual sugar. Higher wine quality has lower density and seems to have slightly higher residual sugar content.
Differing wine qualities do not have a relatively clear distinction in residual sugar levels.
Action: Remove residual sugar as predictor.
Chlorides has a sufficient correlation with wine quality (cor = -0.129).
Chlorides has a correlation of -0.265 with pH. Higher wine quality has lower chlorides and has slightly lower pH levels.
Differing wine qualities have a relatively clear distinction in pH levels.
Action: Add pH as predictor.
Based on the previous analyses, I have decided to use the following variables are predictors for wine quality: - Alcohol, Volatile acidity, Sulphates, Density, Chlorides, pH, Citric acid
The following variable have been excluded as predictors: - Residual sugar
##
## Calls:
## m1: lm(formula = I(quality) ~ I(alcohol), data = rw)
## m2: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity, data = rw)
## m3: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity + sulphates,
## data = rw)
## m4: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity + sulphates +
## density, data = rw)
## m5: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity + sulphates +
## density + chlorides, data = rw)
## m6: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity + sulphates +
## density + chlorides + pH, data = rw)
## m7: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity + sulphates +
## density + chlorides + pH + citric.acid, data = rw)
##
## ======================================================================================================================
## m1 m2 m3 m4 m5 m6 m7
## ----------------------------------------------------------------------------------------------------------------------
## (Intercept) 1.875*** 3.095*** 2.611*** -4.820 -5.677 4.521 -5.885
## (0.175) (0.184) (0.196) (10.387) (10.335) (10.708) (11.930)
## I(alcohol) 0.361*** 0.314*** 0.309*** 0.316*** 0.300*** 0.305*** 0.321***
## (0.017) (0.016) (0.016) (0.018) (0.019) (0.019) (0.020)
## volatile.acidity -1.384*** -1.221*** -1.219*** -1.164*** -1.069*** -1.193***
## (0.095) (0.097) (0.097) (0.097) (0.101) (0.119)
## sulphates 0.679*** 0.664*** 0.857*** 0.848*** 0.848***
## (0.101) (0.103) (0.112) (0.112) (0.112)
## density 7.392 8.411 -0.506 10.281
## (10.331) (10.281) (10.561) (11.886)
## chlorides -1.653*** -1.929*** -1.782***
## (0.394) (0.401) (0.407)
## pH -0.419*** -0.534***
## (0.120) (0.134)
## citric.acid -0.269*
## (0.136)
## ----------------------------------------------------------------------------------------------------------------------
## R-squared 0.227 0.317 0.336 0.336 0.343 0.348 0.350
## adj. R-squared 0.226 0.316 0.335 0.334 0.341 0.346 0.347
## sigma 0.710 0.668 0.659 0.659 0.655 0.653 0.653
## F 468.267 370.379 268.912 201.751 166.599 141.816 122.332
## p 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -1721.057 -1621.814 -1599.384 -1599.127 -1590.346 -1584.293 -1582.343
## Deviance 805.870 711.796 692.105 691.882 684.325 679.163 677.509
## AIC 3448.114 3251.628 3208.768 3210.255 3194.692 3184.587 3182.686
## BIC 3464.245 3273.136 3235.654 3242.518 3232.332 3227.604 3231.080
## N 1599 1599 1599 1599 1599 1599 1599
## ======================================================================================================================
The original model has an R-squared value of 0.35, meaning that it can explain 35% of variability in wine quality.
We can increase this as follows: - From the original model, it seems that Density does not add to the predictive power of the model. Hence, it is removed from this revised model. - Additionally, we will use log10 scales for sulphates and chlorides as it will improve the normality of the distribution of the observations, reduce outliers and improve its predictive power.
The revised model has an improved R-squared value of 0.36.
##
## Calls:
## m1: lm(formula = I(quality) ~ I(alcohol), data = rw)
## m2: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity, data = rw)
## m3: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity + log10(sulphates),
## data = rw)
## m4: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity + log10(sulphates) +
## citric.acid, data = rw)
## m5: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity + log10(sulphates) +
## citric.acid + log10(chlorides), data = rw)
## m6: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity + log10(sulphates) +
## citric.acid + log10(chlorides) + pH, data = rw)
##
## ========================================================================================================
## m1 m2 m3 m4 m5 m6
## --------------------------------------------------------------------------------------------------------
## (Intercept) 1.875*** 3.095*** 3.369*** 3.444*** 3.080*** 4.842***
## (0.175) (0.184) (0.184) (0.196) (0.218) (0.449)
## I(alcohol) 0.361*** 0.314*** 0.303*** 0.303*** 0.283*** 0.302***
## (0.017) (0.016) (0.016) (0.016) (0.017) (0.017)
## volatile.acidity -1.384*** -1.156*** -1.217*** -1.107*** -1.110***
## (0.095) (0.097) (0.112) (0.116) (0.115)
## log10(sulphates) 1.477*** 1.518*** 1.716*** 1.742***
## (0.177) (0.181) (0.188) (0.187)
## citric.acid -0.113 -0.013 -0.276*
## (0.103) (0.106) (0.121)
## log10(chlorides) -0.487*** -0.564***
## (0.132) (0.132)
## pH -0.595***
## (0.133)
## --------------------------------------------------------------------------------------------------------
## R-squared 0.227 0.317 0.345 0.346 0.352 0.360
## adj. R-squared 0.226 0.316 0.344 0.344 0.349 0.357
## sigma 0.710 0.668 0.654 0.654 0.651 0.647
## F 468.267 370.379 280.646 210.808 172.704 148.984
## p 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -1721.057 -1621.814 -1587.752 -1587.153 -1580.350 -1570.342
## Deviance 805.870 711.796 682.108 681.597 675.822 667.415
## AIC 3448.114 3251.628 3185.503 3186.306 3174.699 3156.683
## BIC 3464.245 3273.136 3212.389 3218.569 3212.339 3199.700
## N 1599 1599 1599 1599 1599 1599
## ========================================================================================================
The strongest relationships were between wine quality and the 2 strongest correlated variables, alcohol and volatile acidity. It clearly shows that quality of wine increases with higher alcohol levels and lower volatile acidity.
Other variable combinations that strengthened each other were as follows: (alcohol and density), (alcohol and chlorides), (sulphates and volatile acidity), (density and citric acid). This is judged by the level of distinction between the different quality scores each combination of variables can predict.
Alcohol and density was an interesting interaction. I was surprised that density decreased when alcohol levels increased. Incidentally, higher alcohol levels and lower density leads to better scoring wines.
This roughly proves that drinkers like a light textured, and stronger drink!
However, based on our linear model, it seems that density does not have any predictive qualities when alcohol is added a predictor.
A linear model was created to predict wine quality. The revised model uses the following variables as predictors: Alcohol, Volatile acidity, log10(Sulphates), log10(Chlorides), pH, Citric acid.
It is able to explain 36% of variability in wine quality, which is not very high. This is likely caused by the relatively low correlations between predictors and wine quality.
This first plot illustrates that a majority of the of the wines in the data sets have a quality score of 5 and 6 (ie. average quality wine). The first and thrid quartiles of data fall within the scores of 5 and 6, meaning at least 50% of wines in the data are from this quality groups.
This improves our understanding of what levels of each variables make average quality wine.
If the data set is more even and has a much higher representation of very low (scores 3 to 4) and very high (scores 7 to 8) quality wines, we will be able to learn more about what levels of each variables makes very good or very bad wines.
The second plot shows the relationship between wine quality and its strongest correlated variables, alcohol level.
There is a positive correlation where the higher the alcohol levels, the higher the quality of the wine. There is a clear increase in mean and median of alcohol levels as the quality of wine increases.
The third plot shows the combined relationship between alcohol and chloride levels and quality. The combination of these variables are important as it has a clear relationship with wine quality. The lower the chloride levels and the higher the alcohol content, the higher the quality of wine.
This also implies that drinkers tend to prefer stronger and less salty wines!
The combination of both these variables have good predictive qualities for determining wine quality.
I ran into difficulty trying to decide the best way to visualise the data. I explored the many different plot options and struggled to choose a single plot as I thought each of the plots helped improve understanding of the data set. I finally decided to use multiple plots to visualise a single variable. Box, scatter and line plots used in combination has been really helpful in this analysis.
I also ran into difficulty with the distribution of some of the variables (such as sulphates and chlorides). They had really long tails and were skewed, and these tend to hide patterns/relationships of these variables. I eventually overcame this by using a log10 scale on both these variables.
I found success and was happy to find some key predictors of wine quality such as alcohol levels, volatile acidity, sulphates, chlorides and citric acid. They have a good level of correlation with quality and when used as predictors in a linear prediction model, it can explain 36% of variability in wine quality. The flipside of this is that an expanded dataset (as elaborated further below), could potentially uncover some stronger predictors an improve the prediction model.
Given the nature of the data set, where a majority of the wines analysed were of average quality, it reduces my confidence in the predictive power of any models built from this data set. It would be good to be able to expand this data set to include both very good and very bad wines.
It would be good to be able to expand on the data set from a quality rating perspective. Currently, we have a single quality score, which may be based on multiple criteria such as taste, texture, price to name a few. It would be good to brign those additional variables (and potentially useful predictors) into the data set. This will open the door for us to analyse things like what drives wine prices and also what combination of chemical properties result in different wine textures.
Overall, I feel that I know more about red wine and the different variables to pick up on when trying to determine what would be a good quality red wine.